Mini Project¶
We are going to use Enron dataset again to learn regressions.
We will try to infer "bonus" (target) from "salary" (inputs) of employees.
As always, ensure python 2 environment first (else modify as per udacity's instructions to run in python 3.x)
# ensuring python version
import sys
sys.version
sys.version_info
#!/usr/bin/python
"""
Starter code for the regression mini-project.
Loads up/formats a modified version of the dataset
(why modified? we've removed some trouble points
that you'll find yourself in the outliers mini-project).
Draws a little scatterplot of the training/testing data
You fill in the regression code where indicated:
"""
%matplotlib inline
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )
### list the features you want to look at--first item in the
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )
### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"
### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
plt.scatter( feature, target, color=test_color )
for feature, target in zip(feature_train, target_train):
plt.scatter( feature, target, color=train_color )
### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")
### draw the regression line, once it's coded
try:
plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
Slope and Intercept¶
Import LinearRegression from sklearn, and create/fit your regression. Name it reg so that the plotting code will show it overlaid on the scatterplot. Does it fall approximately where you expected it?
Extract the slope (stored in the reg.coef_ attribute) and the intercept. What are the slope and intercept?
#!/usr/bin/python
"""
Starter code for the regression mini-project.
Loads up/formats a modified version of the dataset
(why modified? we've removed some trouble points
that you'll find yourself in the outliers mini-project).
Draws a little scatterplot of the training/testing data
You fill in the regression code where indicated:
"""
%matplotlib inline
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )
### list the features you want to look at--first item in the
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )
### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"
### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit (feature_train,target_train)
print reg.coef_
print reg.intercept_
### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
plt.scatter( feature, target, color=test_color )
for feature, target in zip(feature_train, target_train):
plt.scatter( feature, target, color=train_color )
### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")
### draw the regression line, once it's coded
try:
plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
Slope is -5.45 and Intercept, -102360.5
Regression Score: Training Data¶
Imagine you were a less savvy machine learner, and didn't know to test on a holdout test set. Instead, you tested on the same data that you used to train, by comparing the regression predictions to the target values (i.e. bonuses) in the training data. What score do you find?
from sklearn.metrics import r2_score
# predicting the 'bonus' from 'training inputs/salaries'
target_predictions = reg.predict(feature_train)
# comparing predicted 'bonus' with 'training' bonus
score = r2_score(target_train, target_predictions)
score
Regression Score: Test Data¶
Now compute the score for your regression on the test data, like you know you should. What's that score on the testing data?
from sklearn.metrics import r2_score
# predicting the 'bonus' from 'TEST inputs/salaries'
target_predictions = reg.predict(feature_test)
# comparing predicted 'bonus' with 'TEST' bonus
score = r2_score(target_test, target_predictions)
score
Regressing Bonus Against LTI¶
Regress the bonus against the long term incentive, and see if the regression score is significantly higher than regressing the bonus against the salary. Perform the regression of bonus against long term incentive--what's the score on the test data?
Step 1: Redo regression, this time with LTI. Changes noted with comment 'CHANGED HERE'
#!/usr/bin/python
"""
Starter code for the regression mini-project.
Loads up/formats a modified version of the dataset
(why modified? we've removed some trouble points
that you'll find yourself in the outliers mini-project).
Draws a little scatterplot of the training/testing data
You fill in the regression code where indicated:
"""
%matplotlib inline
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )
### list the features you want to look at--first item in the
### list will be the "target" feature
features_list = ["bonus", "long_term_incentive"] #CHANGED HERE
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )
### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"
### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit (feature_train,target_train)
print reg.coef_
print reg.intercept_
### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
plt.scatter( feature, target, color=test_color )
for feature, target in zip(feature_train, target_train):
plt.scatter( feature, target, color=train_color )
### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")
### draw the regression line, once it's coded
try:
plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
Step 2: Predict and calculate the score
with test data..
from sklearn.metrics import r2_score
# predicting the 'bonus' from 'TEST inputs/long term incentives'
target_predictions = reg.predict(feature_test)
# comparing predicted 'bonus' with 'TEST' bonus
score = r2_score(target_test, target_predictions)
score
Sneak Peek: Outliers Break Regressions¶
Add these two lines near the bottom of finance_regression.py, right before plt.xlabel(features_list[1])
:
reg.fit(feature_test, target_test)
plt.plot(feature_train, reg.predict(feature_train), color="b")
(brightness reduced for training and test data points to make regression line more visible)
#!/usr/bin/python
"""
Starter code for the regression mini-project.
Loads up/formats a modified version of the dataset
(why modified? we've removed some trouble points
that you'll find yourself in the outliers mini-project).
Draws a little scatterplot of the training/testing data
You fill in the regression code where indicated:
"""
%matplotlib inline
import sys
import pickle
#sys.path.append("../../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../17. Final Project/final_project_dataset_modified.pkl", "r") )
### list the features you want to look at--first item in the
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )
### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = '#BBDEFB' #"b"
test_color = '#FFCDD2'#"r"
### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit (feature_train,target_train)
print reg.coef_
print reg.intercept_
### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
plt.scatter( feature, target, color=test_color )
for feature, target in zip(feature_train, target_train):
plt.scatter( feature, target, color=train_color )
### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")
### draw the regression line, once it's coded
try:
plt.plot( feature_test, reg.predict(feature_test))
except NameError:
pass
# OUTLIERS
reg.fit(feature_test, target_test)
plt.plot(feature_train, reg.predict(feature_train), color="r")
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
What is the slope of the new regression line?
print reg.coef_